Testing Structural Properties in Textual Data: Beyond Document Grammars

نویسندگان

  • Felix Sasaki
  • Jens Pönninghaus
چکیده

This article describes research carried out in the project "Secondary information structuring and comparative discourse analysis" (SEKIMO), which is part of the research group "Texttechnological modeling of information" and is funded by the German Research Council (DFG). In our project, we use XML document grammars, i.e. DTDs (Bray et al., 2000), XML Schema (Thompson et al., 2001) and Relax NG (Clark and Murata, 2001) to formalize and interrelate linguistic phenomena in typologically diverse languages. The document grammars differ in what they describe, that is morphosyntactic structures, semantic relations and discourse functions, and in the granularity of the description; i.e. there are language or dialogue type specific document grammars on the one hand and document grammars of a more general kind on the other hand. At the level of secondary information structuring, we interrelate the document grammars, sometimes creating 'intermediate' document grammars in order to connect the specific and general levels of linguistic description. All document grammars are developed on the basis of and applied to dialogue and text corpora in different languages. (For more information about the project, see www.text-technology.de). Schema languages usually define grammatical constraints on document structures, i.e. hierarchical relations between elements in a tree-like structure. Especially but not only for the linguistic phenomena we want to describe, it seems useful to complement the concept of hierarchical validation with a methodology for defining and applying other structural constraints as there are several limitations in implementing appropriate document grammars. The main benefits of this methodology are: • Addition of constraints which are hard to express using schema languages • Independent formulation of constraints; adding new constraints does not require changes to document schema • Classification of information items; assigning classes based on fulfillment of constraints.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Information Retrieval from Structured Documents Represented by Attribute Grammars

This paper presents a system for Information Retrieval (IR) from collections of structured documents represented by Attribute Grammars (AG). Each document corresponds to a syntactic tree with nodes decorated with sets of attributes. The values of these attributes correspond to characteristics which specify the semantics of the textual content and the structure in order to perform IR. First, we ...

متن کامل

Using Attribute Grammars to Uniformly Represent Structured Documents - Application to Information Retrieval

This paper presents an ongoing work to uniformly represent structured documents by mean of Attribute Grammars (AG). Each document corresponds to a syntactic tree with nodes decorated with sets of attributes. The values of these attributes correspond to characteristics which specify the semantics of both the textual content and the structural elements. We show how to use this representation for ...

متن کامل

XML Content Management Based on Object-Relational Database Technology

XML (Extensible Markup Language) is a textual markup language designed for the creation of self-describing documents. Such documents contain textual data combined with structural information describing the structure of the textual data. Currently, products and approaches for document-oriented application domains focus mainly on the textual representation when processing and analyzing documents....

متن کامل

Querying Large Collections of Semistructured Data

An increasing amount of data is published as semistructured documents formatted with presentational markup. Examples include data objects such as mathematical expressions encoded with MathML or web pages encoded with XHTML. Our intention is to improve the state of the art in retrieving, manipulating, or mining such data. We focus first on mathematics retrieval, which is appealing in various dom...

متن کامل

A survey on Automatic Text Summarization

Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • LLC

دوره 18  شماره 

صفحات  -

تاریخ انتشار 2003